This is an analysis of Capital Bikeshare data from Washington D.C. over two years.
“Bike sharing systems are a new generation of traditional bike rentals where the whole process from membership, rental and return back has become automatic. Through these systems, user is able to easily rent a bike from a particular position and return back to another position. Currently, there are about over 500 bike-sharing programs around the world which are composed of over 500 thousands bicycles. Today, there exists great interest in these systems due to their important role in traffic, environmental and health issues.
Apart from interesting real-world applications of bike sharing systems, the characteristics of data being generated by these systems make them attractive for the research. Opposed to other transport services such as bus or subway, the duration of travel, departure and arrival position is explicitly recorded in these systems. This feature turns bike sharing system into a virtual sensor network that can be used for sensing mobility in the city. Hence, it is expected that most of important events in the city could be detected via monitoring these data." - Kaggle
Imports
suppressMessages(library(tswge))
suppressMessages(library(tidyverse))
suppressMessages(library(plotly))
suppressMessages(library(xts))
suppressMessages(library(vars))
suppressMessages(library(nnfor))
## Warning: package 'nnfor' was built under R version 3.5.2
## Warning: package 'forecast' was built under R version 3.5.2
Mission Statement
We are fictive members of the city of D.C’s. Department of Transportation. We are interested knowing about bike usage patterns in the city for our congestion and events initiative. We were able to find this bike share data and we assume the results of our analysis will generalize to non-bike share members. We understand this is purely a case study it may not accurately represent the entire bike population of Washington D.C.
We have major road construction happening in the next six months and need to report the expected impact on the biking populace over this time. The feature of interest we would like to predict is “Total Users”. The horizon we will be predicting over is the next six months. Since we are interested in more long term and gross trends we are reducing our data from an hourly sample rate to a daily average.
Load Data
bike_data = read_csv('/Users/kaileyayala/Documents/Learning/SMU/Time Series/Time Series Bike Project/bike data.csv')
head(bike_data)
## # A tibble: 6 x 14
## Date Season Hour Holiday `Day of the Wee… `Working Day` `Weather Type`
## <chr> <int> <int> <int> <int> <int> <int>
## 1 1/1/… 4 0 0 6 0 1
## 2 1/1/… 4 1 0 6 0 1
## 3 1/1/… 4 2 0 6 0 1
## 4 1/1/… 4 3 0 6 0 1
## 5 1/1/… 4 4 0 6 0 1
## 6 1/1/… 4 5 0 6 0 2
## # ... with 7 more variables: `Temperature F` <dbl>, `Temperature Feels
## # F` <dbl>, Humidity <int>, `Wind Speed` <int>, `Casual Users` <int>,
## # `Registered Users` <int>, `Total Users` <int>
We can see above we have a variety of features to be working with here. We have: Date, Season, Hour, Holiday, Day of the Week, Working Day, Weather Type, Temperature F, Temperature Feels F, Humidity, Wind Speed, Casual Users, Registered Users, Total Users. Since we are trying to forecast, the features we will retain for these predictions are the ones we can reasobably expect to be able to predict. We won’t be building models for these features. Instead we will assume we already have them on hand.
The features we will be using for our analysis are Temperature Feels F which is a metric of what the temperature feels like. Since this is going to be colinear with temperature and humidity those have been omitted. The other feature we are going to be using is Wind Speed.
daily_avg = aggregate(bike_data[c("Total Users", "Wind Speed", "Temperature Feels F")], by = list(bike_data$Date), FUN = mean)
d = daily_avg
daily_avg = d[order(as.Date(d$Group.1, format="%m/%d/%Y")),]
rownames(daily_avg) <- 1:nrow(daily_avg)
daily_avg = data.frame(daily_avg)
colnames(daily_avg) = c("Date", "Total Users", "Wind Speed", "Temp. Feels F")
daily_avg= subset(daily_avg, select=c("Temp. Feels F", "Wind Speed", "Total Users"))
plot.ts(daily_avg)

bike_data = daily_avg
bikeTest = data.frame(bike_data$"Temp. Feels F"[550:731],
bike_data$"Wind Speed"[550:731],
bike_data$"Total Users"[550:731])
colnames(bikeTest) = c("Temp. Feels F", "Wind Speed", "Total Users")
bikeTrain = data.frame(bike_data$"Temp. Feels F"[0:548],
bike_data$"Wind Speed"[0:548],
bike_data$"Total Users"[0:548])
colnames(bikeTrain) = c("Temp. Feels F", "Wind Speed", "Total Users")
head(bike_data)
## Temp. Feels F Wind Speed Total Users
## 1 46.40000 10.75000 41.04167
## 2 45.22609 16.65217 34.82609
## 3 25.70000 16.63636 61.31818
## 4 28.40000 10.73913 67.91304
## 5 30.43478 12.52174 69.56522
## 6 30.90435 6.00000 69.82609
Check for Stationarity
To determine which models are most reasonable, we must inspect stationarity.
Total_Users = bike_data$`Total Users`
plot_ = plotts.sample.wge(Total_Users)

first_half = bike_data$`Total Users`[1:(length(bike_data$`Total Users`)/2)]
second_half = bike_data$`Total Users`[(length(bike_data$`Total Users`)/2+1):(length(bike_data$`Total Users`))]
acf(first_half)
acf(second_half)

The mean of the realization appears to be varying based on time. This is because it looks like there is seasonality on the order of roughly one year and that seasonality makes the change in trend over time more apparent. However we can’t be sure this is trend or if this is just wandering especially since our dataset is so short. We only have one realization of total users because of the nature of the dataset so we can’t average realizations to have a better estimation of this either. This realization appears to be non-stationary but we will be investigating both stationary and non-stationary models.
It’s hard to tell if there is change in variance over time since we have only one realization. It appears there may be smaller variance at the beginning of the realization and larger variance towards the end.
The slow decay of the ACF is indicative of non-stationarity. The first and second halves are slightly different. This is an indicator of non-stationary data.
We can see from the spectral density there is a peak at zero which is indicative of wandering.
ARMA Model
We want to see what an ARMA model does with this data.
# ARMA Model:
aic5.wge(bikeTrain$"Total Users") # picks p=2, q=1
## ---------WORKING... PLEASE WAIT...
##
##
## Five Smallest Values of aic
## p q aic
## 8 2 1 7.146347
## 6 1 2 7.148354
## 11 3 1 7.148890
## 14 4 1 7.149948
## 15 4 2 7.150039
aic5.wge(bikeTrain$"Total Users", type='bic') # picks p=2, q=1
## ---------WORKING... PLEASE WAIT...
##
##
## Five Smallest Values of bic
## p q bic
## 8 2 1 7.177780
## 6 1 2 7.179786
## 11 3 1 7.188181
## 9 2 2 7.191442
## 14 4 1 7.197097
# both AIC and BIC agree on ARMA(2,1)
est_bike_arma = est.arma.wge(bikeTrain$`Total Users`, p=2, q=1)
##
## Coefficients of Original polynomial:
## 1.2488 -0.2504
##
## Factor Roots Abs Recip System Freq
## 1-0.9979B 1.0021 0.9979 0.0000
## 1-0.2509B 3.9855 0.2509 0.0000
##
##
plot_ = plotts.sample.wge(est_bike_arma$res, arlimits = TRUE)

We can see the ACF looks like white noise. Though the realization of the residuals looks like it still has something in it. It looks like the variance of the residuals depends on time. The spectral density doesn’t have much wandering anymore. Though it still looks to have more large peaks than I would expect white noise to have.
ljung24 = ljung.wge(est_bike_arma$res,p=2,q=1)
## Obs 0.0004038083 -0.02554771 -0.04673648 -0.03274126 -0.04670617 0.09935826 -0.03411745 0.02792229 -0.03714409 -0.01229535 -0.006755202 0.01268319 -0.002017354 0.04141772 0.02628097 -0.01700396 0.01715238 0.01055376 -0.03995274 0.01272378 0.07013586 0.03966986 -0.01047915 -0.01263622
ljung48 = ljung.wge(est_bike_arma$res,p=2,q=1,K=48)
## Obs 0.0004038083 -0.02554771 -0.04673648 -0.03274126 -0.04670617 0.09935826 -0.03411745 0.02792229 -0.03714409 -0.01229535 -0.006755202 0.01268319 -0.002017354 0.04141772 0.02628097 -0.01700396 0.01715238 0.01055376 -0.03995274 0.01272378 0.07013586 0.03966986 -0.01047915 -0.01263622 -0.02509604 0.03989361 0.0219746 0.08069796 0.1305351 -0.09480874 -0.07180454 -0.105316 0.02683909 0.01314762 0.0671599 -0.01793263 0.02225972 -0.07038853 0.04325937 -0.01774715 0.01868202 -0.04873514 0.05160783 -0.008351998 -0.005114024 0.008506614 0.0008044217 0.004264309
ljung24$pval #0.6801141
## [1] 0.677539
ljung48$pval #0.0829797
## [1] 0.08269171
The ljung-box test with k of 24 and 48 give us p-values respectively of 0.68 and 0.08. This means there is not sufficient evidence to reject the null hypothesis of white noise. This means we don’t have evidence for the residuals being correlated.
for_bike_arma= fore.arma.wge(bikeTrain$`Total Users`, phi=est_bike_arma$phi, theta = est_bike_arma$theta, n.ahead= 182 , limits = T , lastn=F )

The forcast does not look representative of the data. The gradual return to the mean is expected from this kind of model but right off the bat it looks like we could do a lot better with other kinds of models.
total_users = bike_data$`Total Users`
len_total_users = length(total_users)
plot(x= seq((len_total_users -182 ), len_total_users) , y =
total_users[(len_total_users -182 ): len_total_users], type='l')
lines(x= seq((len_total_users -182+1 ), len_total_users) , y = for_bike_arma$f,
type= "l", col = 'blue')
lines(x= seq((len_total_users -182+1 ), len_total_users) , y = for_bike_arma$ll,
type= "l", col = 'blue')
lines(x= seq((len_total_users -182+1 ), len_total_users) , y = for_bike_arma$ul,
type= "l", col = 'blue')

This is a close up on the forecast. We can see the confidence intervals are very wide and the forecast doesn’t folow the realization very well.
ASE_ARMA21 = mean((for_bike_arma$f - bike_data$`Total Users`[(length(bike_data$`Total Users`) -182+1 ): length(bike_data$`Total Users`)])^2)
ASE_ARMA21
## [1] 4526.317
Our ASE from this model is 4586. This is now our benchmark for the other models’ performance.
ARIMA Model
Now we want to try a Non-stationary model. Let’s see what an ARIMA does with this.
### ARIMA model
bike_tr1 = artrans.wge(bikeTrain$`Total Users`, phi = 1)

The first difference removed the slowly damping behavior from the autocorrelations very well.
aic_ARIMA = aic5.wge(bike_tr1) # picks p=1, q=1
bic_ARIMA = aic5.wge(bike_tr1, type='bic') # picks p=1, q=1
aic_ARIMA
## p q aic
## 5 1 1 7.140812
## 3 0 2 7.142006
## 8 2 1 7.143583
## 18 5 2 7.144420
## 6 1 2 7.144445
bic_ARIMA
## p q bic
## 5 1 1 7.164420
## 3 0 2 7.165613
## 8 2 1 7.175059
## 6 1 2 7.175921
## 11 3 1 7.184375
#both AIC and BIC pick ARMA(1,1)
Both the AIC and BIC chose p=1 and q=1.
est_bike_arima = est.arma.wge(bike_tr1, p=1, q=1)
##
## Coefficients of Original polynomial:
## 0.2568
##
## Factor Roots Abs Recip System Freq
## 1-0.2568B 3.8938 0.2568 0.0000
##
##
Residual_Diagnostic_Plot_ARIMA = plotts.sample.wge(est_bike_arima$res, arlimits = TRUE)

ljung24 = ljung.wge(est_bike_arima$res,p=1,q=1)
## Obs 0.005474324 -0.01659523 -0.0379841 -0.02492649 -0.040171 0.1043309 -0.02976697 0.03216769 -0.03225387 -0.008003643 -0.003167844 0.01590497 0.000945492 0.04342421 0.02831104 -0.01477115 0.01976887 0.01337935 -0.03692317 0.01436112 0.07082101 0.04074232 -0.008427779 -0.01009065
ljung48 = ljung.wge(est_bike_arima$res,p=1,q=1,K=48)
## Obs 0.005474324 -0.01659523 -0.0379841 -0.02492649 -0.040171 0.1043309 -0.02976697 0.03216769 -0.03225387 -0.008003643 -0.003167844 0.01590497 0.000945492 0.04342421 0.02831104 -0.01477115 0.01976887 0.01337935 -0.03692317 0.01436112 0.07082101 0.04074232 -0.008427779 -0.01009065 -0.02397844 0.04146099 0.02288407 0.08090141 0.1309639 -0.09212191 -0.06918511 -0.102822 0.02827954 0.01391741 0.06696824 -0.01676866 0.02352513 -0.06895964 0.04455764 -0.01663361 0.02018504 -0.04787052 0.05249205 -0.006820387 -0.003344472 0.01062969 0.003192186 0.006518569
print(ljung24$pval) #0.12787
## [1] 0.7718675
print(ljung48$pval) #0.01794144
## [1] 0.1225847
The ljung-box test with k of 24 and 48 give us p-values respectively of 0.13 and 0.02. This means there is an unclear result there is evidence both for and against white noise.
for_bike_arima= fore.aruma.wge(bikeTrain$`Total Users`, phi=est_bike_arima$phi,
theta = est_bike_arima$theta, d=1,
n.ahead= 182 , limits = T , lastn=F )

total_users = bike_data$`Total Users`
len_total_users = length(total_users)
plot(x= seq((len_total_users -182 ), len_total_users) , y =
total_users[(len_total_users -182 ): len_total_users], type='l')
lines(x= seq((len_total_users -182 ), len_total_users) , y =
total_users[(len_total_users -182 ): len_total_users], type='l')
lines(x= seq((len_total_users -182+1 ), len_total_users) , y = for_bike_arima$f,
type= "l", col='blue')
lines(x= seq((len_total_users -182+1 ), len_total_users) , y = for_bike_arima$ul,
type= "l", col='blue')
lines(x= seq((len_total_users -182+1 ), len_total_users) , y = for_bike_arima$ll,
type= "l", col='blue')

Again we can see the forcast doesn’t match the residuals very well.
ASE2_ARIMA111 = mean((for_bike_arima$f - total_users[(len_total_users -182+1 ): len_total_users])^2)
ASE2_ARIMA111
## [1] 5378.435
# Compare all models with the ASE… this does not mean you have to choose the model with the lowest ASE.
# Pick a forecast horizon based on your “problem” from part 3 above and provide the forecasts and prediction limits.
The ASE for this ARIMA(1,1,1) model is 5382. This was even larger than the ASE for the ARMA model.
ARUMA Model
Now we want to try a Non-stationary model. Let’s see what an ARUMA does with this.
### ARUMA model
bike_tr365 = artrans.wge(bikeTrain$`Total Users`, phi.tr = c(rep(0,364),1))

bike_tr365.1 = artrans.wge(bike_tr365, phi = 1)

The first difference removed the slowly damping behavior from the autocorrelations very well.
bic_ARUMA = aic5.wge(bike_tr365.1, type='bic') # picks p=0, q=1
bic_ARUMA
## p q bic
## 2 0 1 7.890057
## 3 0 2 7.902506
## 5 1 1 7.903536
## 8 2 1 7.929717
## 6 1 2 7.931086
Both the AIC and BIC chose p=1 and q=1.
est_bike_aruma = est.arma.wge(bike_tr365.1, p=0, q=1)
Residual_Diagnostic_Plot_ARUMA = plotts.sample.wge(est_bike_aruma$res, arlimits = TRUE)

ljung24 = ljung.wge(est_bike_aruma$res,p=0,q=1)
## Obs 0.09915153 -0.06750875 -0.113931 -0.03214805 -0.04319667 0.2165408 -0.04262793 0.03160498 0.007324697 -0.08718502 -0.1166617 0.07345292 -0.008094529 0.04394723 -0.01557514 0.03811541 0.02207544 -0.04132964 -0.08267888 0.02998653 0.09608286 0.07312497 0.007524756 -0.1735543
ljung48 = ljung.wge(est_bike_aruma$res,p=0,q=1,K=48)
## Obs 0.09915153 -0.06750875 -0.113931 -0.03214805 -0.04319667 0.2165408 -0.04262793 0.03160498 0.007324697 -0.08718502 -0.1166617 0.07345292 -0.008094529 0.04394723 -0.01557514 0.03811541 0.02207544 -0.04132964 -0.08267888 0.02998653 0.09608286 0.07312497 0.007524756 -0.1735543 -0.1237941 -0.02731696 0.01183997 0.17362 0.1647008 -0.1351105 -0.0819193 -0.1425874 -0.004885676 0.03334339 0.07613064 -0.1158271 -0.01398124 -0.09569558 0.001470974 -0.06574228 -0.01230819 0.02559737 0.03948206 -0.04834293 0.0124064 -0.08563702 -0.07923627 -0.1155405
print(ljung24$pval) #0.004088722
## [1] 0.08949701
print(ljung48$pval) #4.988254e-05
## [1] 0.006830042
The ljung-box test with k of 24 and 48 give us p-values near zero. This is strong evidence to reject the null hypothesis of white noise.
for_bike_aruma= fore.aruma.wge(bikeTrain$`Total Users`, phi=est_bike_arima$phi,
theta = est_bike_aruma$theta, d=1,
n.ahead= 182 , limits = T , lastn=F )

total_users = bike_data$`Total Users`
len_total_users = length(total_users)
plot(x= seq((len_total_users -182 ), len_total_users) , y =
total_users[(len_total_users -182 ): len_total_users], type='l')
lines(x= seq((len_total_users -182 ), len_total_users) , y =
total_users[(len_total_users -182 ): len_total_users], type='l')
lines(x= seq((len_total_users -182+1 ), len_total_users) , y = for_bike_aruma$f,
type= "l", col='blue')
lines(x= seq((len_total_users -182+1 ), len_total_users) , y = for_bike_aruma$ul,
type= "l", col='blue')
lines(x= seq((len_total_users -182+1 ), len_total_users) , y = for_bike_aruma$ll,
type= "l", col='blue')

Again we can see the forcast doesn’t match the residuals very well.
ASE2_ARUMA011 = mean((for_bike_aruma$f - total_users[(len_total_users -182+1 ): len_total_users])^2)
ASE2_ARUMA011
## [1] 5211.796
# Compare all models with the ASE… this does not mean you have to choose the model with the lowest ASE.
# Pick a forecast horizon based on your “problem” from part 3 above and provide the forecasts and prediction limits.
The ASE for this ARIMA(1,1,1) model is 5382. This was even larger than the ASE for the ARMA model.